Skip to content

refactor: contain image converter file paths within root_path#11229

Open
etairl wants to merge 2 commits intodeepset-ai:mainfrom
etairl:sandbox-image-utils-document-paths
Open

refactor: contain image converter file paths within root_path#11229
etairl wants to merge 2 commits intodeepset-ai:mainfrom
etairl:sandbox-image-utils-document-paths

Conversation

@etairl
Copy link
Copy Markdown

@etairl etairl commented Apr 30, 2026

Summary

  • _extract_image_sources_info (in haystack/components/converters/image/image_utils.py) was joining a Document's file_path metadata field with root_path via Path(root_path, file_path) and then calling open() on the result.
  • Absolute paths in metadata short-circuit the join (e.g. Path('/data', '/etc/passwd') == Path('/etc/passwd')) and ../ segments resolve outside the directory. Any caller that propagates document metadata from less-trusted sources therefore ends up reading files outside the configured root.
  • This PR rejects absolute paths up-front, resolves the joined path, and verifies that the result is the configured root_path or a descendant of it before opening the file.

Used by, among others, LLMDocumentContentExtractor and DocumentToImageContent. The behavior for documents whose file_path is a normal relative path inside root_path is unchanged.

Test plan

  • CI runs the existing image-utils / DocumentToImageContent / LLMDocumentContentExtractor unit tests.
  • Manual: a document with meta['file_path'] = 'subdir/image.png' resolves and loads as before.
  • Manual: a document with meta['file_path'] = '/etc/passwd' raises ValueError (absolute path).
  • Manual: a document with meta['file_path'] = '../../../etc/passwd' raises ValueError (escapes root).

🤖 Generated with Claude Code

_extract_image_sources_info joined a Document's caller-supplied
``file_path`` metadata with ``root_path`` via ``Path(root_path, file_path)``
without any containment check. Absolute paths in metadata short-circuit
the join and ``..`` segments resolve outside the directory, so any
caller that propagates document metadata from less-trusted sources can
end up reading files outside the configured root.

Reject absolute file_path values up-front, resolve the joined path, and
verify the result is the configured root or a descendant of it. Raise
ValueError with a clear message when these checks fail. Behavior for
documents with relative paths inside root_path is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@etairl etairl requested a review from a team as a code owner April 30, 2026 10:23
@etairl etairl requested review from davidsbatista and removed request for a team April 30, 2026 10:23
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 30, 2026

@etairl is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 30, 2026

CLA assistant check
All committers have signed the CLA.

@anakin87
Copy link
Copy Markdown
Member

See #11226 (review)

Release-note linter rejects single backticks for inline code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants